still has a ways to go, but this is a start
all images can be scaled
Singapore has emerged as one of the world’s most prosperous countries. In addition to being a financial center, its an achievement in urban planning and serves as a model for developing nations.
Public housing in Singapore is currently subsidized, built, and managed by the Government of Singapore. Singapore has one of the world’s highest home ownership rates. More than 80% of the 5.8M population live in Housing Development Board (HDB) apartments, commonly known as HDB "flats", of which more than 90% own their home. More than 90% of Singaporeans in public housing own the apartment they live in. The government subsidizes the cost of new homes, and buyers can get loans from the Housing Development Board, along with a 10% down payment. Singapore’s housing estates are considered mixed-income developments.
With more than 1 million flats spread across 24 towns and 3 estates, Singapore’s housing is uniquely different. As a geographic reference, Singapore is slightly more than 3.5 times the size of Washington, D.C. and yet has a somewhat high population.
Background - In 1960 the Singapore Housing and Development Board (HDB) was formed to provide affordable and high-quality housing for residents of this city-state nation. Housing is issued by the state on 99-year leaseholds, and the value of the home in general depends on many factors - (inherent utility value of the property, size/square footage, flat type and model, age, location, geographical proximity, etc).
We will explore and examine various factors (including geospatial features) that can be used to accurately predict the flat resale value…
Our goal is to identify the true drivers of the HDP flat resale price, and create an interactive system to predict these prices.
example of flat:

The primary sets of data utilized for the project included Singapore's HDB Resale Flat Prices (Resale transacted prices), and are currently published by the Singapore Housing and Development Board (HDB) and updated on a weekly basis.
Remember, lease left refers to the number of years to the expiry of the 99-year lease; after which, ownership of the HDB will return to the government. This is a very different concept than what is done in the United States.
Total Transaction Observations: 867,677 number of rows (with y features), covering 11,747 days.
Note: A helpful map to get a feel for Singapore in general is provided here
A dataset example below:
Additional sources of data included:
plotted resale price index placeholder:
Our core data is governed by the Singapore Open Data License (https://data.gov.sg/open-data-licence), which aims to "promote and enable easy reuse of Public Sector data to create value for the Singapore area community and businesses".
According to the bylaws, we are allowed to use, access, download, copy, distribute, transmit, modify and adapt the datasets, or any derived analyses or applications. We followed this bylaw explicitly; we are not allowed to use the datasets in a way that suggests any official status or that a Singapore agency endorsed us or use of their set datasets. We specifically followed their guidance that in our application/website that uses the data, a conspicuous notice acknowledging the source of the datasets and including a link to the most recent version of their posted license be created.
Location Data:
After importing the data, we did conventional deep-dive Exploratory Data Analysis (EDA) in order to get a feel for the dataset. Plots were created to show the distribution of intial features, the value counts breakout per features.
Summary statistics are investigated and plotted as well.
Trying to identify anomalies to possibly be removed.
Average price per square-meter versus categories showed remarkable insight.
Outliers are attempted to be removed
Multiple original-price features were created mathemetically: price_per_sq_ft, price_per_sq_m, price_per_sq_ft_per_lease_yr, and price_per_sq_m_per_lease_yr.
cleaned features - base dataset - example observation):
An enormous amount of Singapore geospatial location features were consolidated and pushed to the database:
Then it was possible to calculate the distances to nodes via the 'geometry' feature.
Python code was called to query OneMap API
Initial Dataset - The initial dataset (prior to embedding location features) was split chronologically and was trained with Linear Regression, XGBoost, and Random Forest Regression models.
Evaluation - In order to evaluate our model, we needed metrics to tell us how accurate our predictions were, and what was the amount of deviation from the actual values. In order to determine how well the model fit the data, we used $R^2$ (the proportion of variance explained), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE).
The validation dataset was used to initially investigate the performance, but also used for tuning the hyperparameters, while the test dataset was used to evaluate the model's performance.
After extensive tuning, the Random Forest Regression model on the initial dataset was able to obtain a training R-square value of insrt, a validation R-square value of insrt, and a final test R-square value of insrt.
For XGBoost, objective= ‘reg:squarederror’ means that since we are faced with a regression problem, the objective will be to minimize the squared error. As always, the goal is to MINIMIZE the error, so the lower the MSE and RMSE are, the BETTER.
Final Dataset (including location features) - After the location features were embedded, the final dataset was trained with a XBGoost and Random Forest Regression models.
Interpretation: - the ability to intepret our model's output and feature importances was important to us. For a conventional Linear Regression model, there are standard coefficient outputs that are easy to understand, but due to the complexity of our data, our models would need to have more explainability due to the lack of a conventional coefficient.
Model feature importance outputs were possible in Random Forest Regression (example here).
SHAP (SHapley Additive exPlanations) is a graphical/numeric approach to explain the output of any machine learning model, and it is one of the items we chose to use to help explain our model outputs. An example of this output is provided here.
Overall Observations:
Overall Observations - Base:
Base Model (not including feature locations:
Feature Importances - The most important features appeared to be:
Initial Dataset - Feature importances for our best baseline random forest regressor are found here, with the following key variables:
# Final Dataset (including location features): Current Model
# --- R2 Scores ---
Train: 0.962
Validation: 0.897
Test: 0.795
# --- MSE ---
Train: 11.153
Validation: 40.907
Test: 71.863
# New Model Output Results:
--- Test Set ---
Mean Absolute Error: ... 6.078747834396562
Mean Squared Error:..... 67.18
RMSE: .................. 8.196381335337657
Coeff of det (R^2):..... 0.809 (1.4 % better)
--- Val Set ---
Mean Absolute Error: ... 4.498476271962494
Mean Squared Error:..... 38.02
RMSE: .................. 6.166026942280251
Coeff of det (R^2):..... 0.904 (0.7 % better)
--- Train Set ---
Mean Absolute Error: ... 2.339212417385334
Mean Squared Error:..... 10.24
RMSE: .................. 3.200258043072929
Coeff of det (R^2):..... 0.965
# New Model Hyperparameters (XGBoost-based)
max_depth=7
min_child_weight=6
gamma = 10
subsample=0.75
colsample_bytree = 0.5
reg_alpha = 100
reg_lambda = 1
n_estimators=800 (can add more if desired)
learning_rate=0.16
seed=42
tree_method='hist
hyperparameters that seemed to matter:
SHAP final values (re-insert):
It should be understood that there are market forces at play over the years during our data time range, so predicting resale prices will not be perfect.
We chose to host the project's raw data in a database on Amazon. Amazon Relational Database Service (RDS) is a collection of managed cloud services that enabled our ability to set up, operate, and scale our database instance in the cloud.
Our choice for the database verion was PostgreSQL, a powerful open source object-relational database system with many years of active development and a strong reputation for reliability, robustness, and performance.
Our code was able to interact with the database via SQLAlchemy (a python SQL toolkit and Object Relational Mapper library). This allowed an engine connection to upload, store, manipulate, merge, and join various database tables of data. PostgreSQL dialect used psycopg2 as the default The PostgreSQL dialect uses psycopg2 as the default python DBAPI. We were able also able to connect to the database remotely via the Postgres tool pgAdmin, which allowed ease of viewing changes.
We enabled PostGIS (an extension in PostgreSQL for storing and managing spatial information) in AWS to consolidate location features. Spatial tables were set up in the database.
Below shows the vast numer of database tables created:
Database: insert a bunch of info here
All data is housed in the Amazon RDS Postgres database, and updates to database tables are pushed periodically.
Some additional areas we plan on researching and investigating:
Modified Distance - Potentially adding a modification of the straight-line geo-spatial distance features in 'Manhattan' form (i.e. taxi-cab / city-block geometry). Singapore is a very walkable city, and distances from the HDB flat to nearby feature locations (such as hospitals, etc) many times are only possible via sidewalk or city streets. We also have examined adding potentially in the future a feature of the total travel time (whether walking or driving) to go from the flat to the specific destination location.
Crime Data - Overlaying another set of features associated with historical crime (similar to NY city’s statistics) is an option. Although crime is extremely low in Singapore, it may be interesting to see if this is a factor in resale pricing values.
PLH (Prime Location Housing) - Scenarios where there are no HDBs there currently (and this is direct from HDB, not a resale transaction). PLH is a new scheme of housing that was recently launched which includes more restrictions when it comes to resale. The concept would be that one could say that a resale could potentially fetch a certain dollar amount based on the machine learning model, allowing a mapping from HDB to appreciation of another set amount. This new housing model for public flats in prime areas includes owners of BTO flats in those areas facing a 10-year minimum occupation period; these flats will be priced with additional subsidies; those who sell their BTO units will have to pay back HDB a percentage of the resale price. The resale buyer criteria for these units will be tighter than for typical resale units…
Unsupervised Clustering - Deeper dive into clustering
Interactivity - Plan to investigate adding more features to the front-end application, including A and B and C
Work Breakout was the following:
Michael - insert
Stuart - insert
Tom - insert
arcGIS view if needed, alllowing view of layers as needed, for familiarity with the area: LINK:
click layers on left-hand-side for filtering
example of initial dataset observation: